api,coordinator: add drain API orchestration by hongyunyan · Pull Request #4762 · pingcap/ticdc

hongyunyan · 2026-04-07T23:48:52Z

What problem does this PR solve?

The remaining part of #4190 is the public drain API and the completion orchestration on top of the runtime layers from #4759, #4760, and #4761. This PR extracts that final layer so reviewers can focus on API semantics, single-session orchestration, and the remaining-work contract without re-reviewing the lower-level protocol and scheduler changes.

Issue Number: ref #4190

What is changed and how it works?

Background:

This is PR 4/4 in the dispatcher-drain split stack and is based on coordinator,scheduler: add drain aware scheduling #4761.
After maintainer,heartbeatpb: add drain target plumbing #4759, maintainer,scheduler: add dispatcher drain runtime #4760, and coordinator,scheduler: add drain aware scheduling #4761, the runtime can already propagate drain targets, evacuate dispatcher work, and avoid moving maintainers back to draining nodes.
This PR only exposes the API and completion orchestration on top.

Motivation:

Keep public API behavior out of the lower-level protocol and scheduler reviews.
Enforce a single active drain session and avoid returning zero remaining before completion is proven.
Keep drain target resend and clear-tombstone retry logic together with the API contract.

Summary:

replace the current stub PUT /api/v1/captures/drain implementation with coordinator-backed drain requests
add in-memory drainSession and drainClearState orchestration in coordinator
keep broadcasting active drain targets and clear tombstones until all nodes observe the correct epoch
aggregate maintainer drain progress and coordinator-side operator state before reporting remaining work
extend coordinator interfaces and tests for the new drain orchestration path

How it works:

the API validates the target and forwards the request to coordinator
coordinator ensures a single active target epoch, requests node liveness drain, tracks progress convergence, and clears the target only after completion is proven
heartbeat observation and node removal both feed the retry-based target and clear control loops

Validation note:

the targeted drain API and coordinator tests below pass
go test ./coordinator in the current environment still hits a fixed-port conflict because *:28300 is already occupied, but the targeted drain tests for this PR pass

Check List

Tests

Unit test
- go test ./api/v1 ./coordinator/operator ./pkg/server
- go test -run 'TestDrain|TestDispatcherDrain|TestSetDrain|Test.*Drain' ./coordinator

Questions

Will it cause performance regression or break compatibility?

This PR intentionally changes the v1 drain endpoint from a stub to real coordinator-backed behavior. The implementation keeps the completion check conservative so the API does not return zero remaining until drain completion is proven.

Do you need to update user documentation, design documentation or monitoring documentation?

No additional design or monitoring document changes are needed for this split. The implementation follows the approved drain-capture split design and existing drain design docs.

Release note

Add coordinator-backed v1 capture drain orchestration with remaining-work reporting.

ti-chi-bot · 2026-04-07T23:48:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign charlescheung96 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-07T23:48:59Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a61f49b2-c1a6-47c4-864c-97778f6b1c81

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch split/pr-4190-3-drain-api-rebased

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request implements the drainCapture API by introducing an epoch-based drain session management system in the coordinator. It adds mechanisms for broadcasting drain targets, tracking changefeed-level migration progress, and ensuring clean session termination through heartbeat acknowledgments. The review feedback highlights opportunities to improve diagnostic logging when a drain is blocked and to optimize the performance of changefeed status aggregation in large-scale environments.

gemini-code-assist · 2026-04-07T23:55:10Z

coordinator/controller_drain.go

+	if observation.remaining > 0 {
+		log.Info("drain completion blocked by remaining work",
+			zap.Stringer("targetNodeID", target),
+			zap.Uint64("targetEpoch", targetEpoch),
+			zap.Int("maintainersOnTarget", observation.maintainersOnTarget),
+			zap.Int("inflightOpsInvolvingTarget", observation.inflightOpsInvolvingTarget),
+			zap.Int("dispatcherCountOnTarget", observation.dispatcherCountOnTarget),
+			zap.Int("targetInflightDrainMoveCount", observation.targetInflightDrainMoveCount),
+			zap.Int("pendingStatusCount", observation.pendingStatusCount),
+			zap.Int("remaining", observation.remaining))
+	}


The log message only triggers when observation.remaining > 0. However, DrainNode can return 1 even when remaining is 0 if other completion signals (like stoppingObserved) are false. This makes it difficult to diagnose why a drain is stuck at 1 from the logs. Consider logging the full observation state whenever !isDrainCompletionProven.

if !isDrainCompletionProven( observation.nodeState, observation.drainingObserved, observation.stoppingObserved, observation.remaining, ) { log.Info("drain completion blocked", zap.Stringer("targetNodeID", target), zap.Uint64("targetEpoch", targetEpoch), zap.Stringer("nodeState", observation.nodeState), zap.Bool("drainingObserved", observation.drainingObserved), zap.Bool("stoppingObserved", observation.stoppingObserved), zap.Int("maintainersOnTarget", observation.maintainersOnTarget), zap.Int("inflightOpsInvolvingTarget", observation.inflightOpsInvolvingTarget), zap.Int("dispatcherCountOnTarget", observation.dispatcherCountOnTarget), zap.Int("targetInflightDrainMoveCount", observation.targetInflightDrainMoveCount), zap.Int("pendingStatusCount", observation.pendingStatusCount), zap.Int("remaining", observation.remaining)) }

gemini-code-assist · 2026-04-07T23:55:10Z

coordinator/controller_drain.go

+	}
+
+	targetID := target.String()
+	cfs := c.changefeedDB.GetReplicating()


Calling c.changefeedDB.GetReplicating() inside aggregateDrainTargetProgress results in an $O(N)$ scan of all replicating changefeeds on every poll of the DrainNode API. While TiCDC changefeed counts are typically manageable in memory, this could become a performance bottleneck or increase GC pressure if the API is polled frequently in clusters with thousands of changefeeds. Consider if this progress can be tracked incrementally or cached.

api,coordinator: add drain API orchestration

33f6d33

ti-chi-bot bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Apr 7, 2026

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 7, 2026

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

hongyunyan added 2 commits April 9, 2026 07:40

Merge branch 'codex/push/pr4761' into codex/push/pr4762

1d1e22d

coordinator: tighten drain progress observation

4385011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api,coordinator: add drain API orchestration#4762

api,coordinator: add drain API orchestration#4762
hongyunyan wants to merge 3 commits intosplit/pr-4190-2c-coordinator-drain-runtimefrom
split/pr-4190-3-drain-api-rebased

hongyunyan commented Apr 7, 2026

Uh oh!

ti-chi-bot bot commented Apr 7, 2026

Uh oh!

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hongyunyan commented Apr 7, 2026

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Uh oh!

ti-chi-bot bot commented Apr 7, 2026

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Apr 7, 2026 •

edited

Loading